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Abstract 

We adapt the alternating linearization method for proximal decomposition to structured regularization 
problems, in particular, to the generalized lasso problems. The method is related to two well-known 
operator splitting methods, the Douglas-Rachford and the Peaceman-Rachford method, but it has descent 
properties with respect to the objective function. Its convergence mechanism is related to that of bundle 
methods of nonsmooth optimization. We also discuss implementation for very large problems, with the 
use of specialized algorithms and sparse data structures. Finally, we present numerical results for several 
synthetic and real-world examples, including a three-dimensional fused lasso problem, which illustrate 
the scalability, efficacy, and accuracy of the method. 
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1 Introduction 



Regularization techniques that encourage sparsity in parameter estimation have gained increasing popularity 



recently. The most widely used example is lasso (Tibshirani 1996 ), where the loss function /(•) is penalized 
by the i\-norm of the unknown coefficients ft G W, to form a modified objective function, 



£(J3) = f(P) + \\\P\\ 1 , A>0, 



(1) 



in order to shrink irrelevant coefficients to zero. Many efficient algorithms have been proposed to solve this 
problem, including ( [Ful[T9"9"8j|Daubechies et"aL|[2004tpfron et al.||2(K%) and ( |Friedman efal|[2007l ). Some 
of them are capable of handling massive data sets with tens of thousands of variables and observations. 

For many practical applications, physical constraints and domain knowledge may mandate additional 
structural constraints on the parameters. For example, in cancer research, it may be important to consider 
groups of interacting genes in each pathway rather than individual genes. In image analysis, it is natural 
to regulate the differences between neighboring pixels in order to achieve smoothness and reduce noise. In 
light of these popular demands, a variety of structured penalties have been proposed to incorporate prior 



information. One of the most important structural penalties is the fused lasso proposed in (Tibshirani et al. 



2005). It utilizes the natural ordering of input variables to achieve parsimonious parameter estimation on 
neighboring coefficients. Beck and Teboulle (2009]) adopt the total variation penalty for image denoising and 
deblurring, in a similar fashion to the two-dimensional fused lasso. Chen et al. ( 2010| ) proposed the graph 
induced fused lasso that penalizes differences between coefficients associated with nodes that are connected. 



A more general structural lasso framework was proposed in (Tibshirani and Taylor] |2011| ), with the following 
form: 

Ctf) = ftf) + \\\RP\\i, A>0, (2) 

where R is an m x p matrix that defines the structural constraints one wants to impose on the coefficients. 
Many regularization problems, including fused lasso and graph induced fused lasso, can be cast in this 
framework. 

When the structural matrix R is relatively simple, as in the original lasso case with R = I, traditional 
path algorithms and coordinate descent techniques can be used to solve the optimization problem efficiently 



( Friedma n etal. 1 2007 ). For more complex structural regularization, these methods cannot be directly applied. 
One of the key difficulties is the non-separability of the nonsmooth penalty function. Coordinate descent 
methods fail to converge under this circumstances ( |Tseng 2001 1. Generic solvers, such as interior point 
methods, can sometimes be used; unfortunately they become increasingly inefficient for large size problems, 
particularly when the design matrix is ill-conditioned ( jChen et aL| 201 1| ). 

In the past two years, many efforts have been devoted to developing efficient optimization techniques 
for solving regularization problems using structured penalties. |Liu et al.] ( |2010[ ) developed a first order and 
a split Bregman scheme, respectively, for solving similar class of problems. In many practical studies, the 
convergence of these two methods can not be guaranteed. Chen et al. ( 2011| ) proposed a modified proximal 
technique for the general structurally penalized problems. It is based on a first order approximation of the 
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nonsmooth penalty function, which can become unstable when dimension is high. Meanwhile, several path 
algorithms have also been proposed to compute the whole regularization path for the general fused lasso 



problem. Hoefling (2010) developed a path algorithm for solving (|2j) when the matrix X T X is nonsingular. 
This technique is not applicable to cases with large dimension of (3 and small number of observations, such 
as gene expression and brain imaging analysis. Tibshirani and Taylor] ( [201 i I extended the path algorithm to 



include all design matrices X, by computing the regularization path of the dual problem. Although fairly 
general, this version of the path algorithm does not scale well with data dimension, as the knots of the 
piecewise linear solution path become very dense. Many of the proposed approaches are versions of the 



operator splitting methods or their dual versions, alternating direction methods (see, e.g., Boyd et al. (2010); 



Combettes and Pesquet (2010), and the references therein). Although fairly general and universal, they 



frequently suffer from slow tail convergence (see (IHe and Yuan||2011|) and the references therein). 



Thus, a need arises to develop a general approach that can solve large scale structured regularization 
problem efficiently. For such an approach to be successful in practice, it should guarantee to converge at a 
fast rate, be able to handle massive data sets, and should not rely on approximating the penalty function. In 
this paper, we propose a framework based on the alternating linearization algorithm of ( |Kiwiel et al.| 1999 1, 
that satisfies all these requirements. 

Formally, we write the objective function as a sum of two convex functions, 

C(P) = f(P) + h(j3), (3) 

where f(/3) is a loss function, which is assumed to be convex with respect to f3, and h(-) is a convex penalty 
function. Any of the functions (or both) may be nonsmooth, but an essential requirement of our framework 
is that each of them can be easily minimized with respect to /3, when augmented by a linear-quadratic term 
Sf=i { s ifii + di(3f), with some vectors s,d G W, d > 0. Our method bears resemblance to operator 
splitting and alternating direction approaches, but differs from them in the fact that it is monotonic with 
respect to the values of (T3]>. We discuss these relations and differences later in section |2T2~j but roughly 



speaking, a special test applied at every iteration of the method decides which of the operator splitting 
iterations is the most beneficial one. 

In our applications, we focus on the quadratic loss function /(•) and the penalty function in the form 
of generalized lasso Q, as the most important case, where comparison with other approaches is available. 
This case satisfies the requirement specified above, and allows for substantial specialization and acceleration 
of the general framework of alternating linearization. In fact, it will be clear from our presentation that any 
convex loss function /(•) can be handled in exactly the same way. 

An important feature of our approach is that problems with the identity design matrix are solved exactly 
in one iteration, even for very large dimension. 

The remainder of the paper is organized as follows. In Section[2j we introduce the alternating linearization 
method and we discuss its relations to other approaches. Section [3] briefly discusses the application to lasso 
problems. In section [4] we describe the application to generalized lasso problems. Section [5] presents sim- 
ulation results and real data examples, which illustrate the efficacy, accuracy, and scalability of the alternating 
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linearization method. Concluding remarks are presented in section [6j The appendix contains details about 
the algorithms used to solve the subproblems of the alternating linearization method. 



2 The alternating linearization method 
2.1 Outline of the method 

In this section, we describe the alternating linearization (ALIN) approach to minimize Q. It is an iterative 
method, which generates a sequence of approximations {/3 k } converging to a solution of the original prob- 
lem (3 1, and two auxiliary sequences: {/^} and {Pj}, where k is the iteration number. Each iteration of the 
ALIN algorithm consists of solving two subproblems: the h-subproblem and the f-subproblem, and of an 
update step, applied after any of the subproblems, or after each of them. 

At the beginning we set = /3°, where /3° is the starting point of the method. In the description below, 
we suppress the superscript k denoting the iteration number, to simplify notation. 

The ft-subproblem 

We linearize /(•) at (3f, and approximate it by the function 

f(P) = f0f) + sJ(p-p f ). 

If /(•) is differentiable, then sj = Vf(f3f); for a general convex /(•), we select a subgradient Sf £ df(pf). 
In the first iteration, this may be an arbitrary subgradient; at later iterations special selection rules apply, as 
described in Q below. 

The approximation is used in the optimization problem 

mm f( p ) + h(P) + l\\p-P\\ 2 D , (4) 
in which the last term is defined as follows: 

II/?-/3|Id = (/3-/3) T £(/3-/3), 

with a diagonal matrix D = diag{(fj, j = 1, . . . ,p},dj > 0,j = l,...,p. The solution of the /i-subproblem 
Q is denoted by f3h- 

We complete this stage by calculating the subgradient of h(-) at (3^, which features in the optimality 
condition for the minimum in Q: 

Oea f + dh(P h )+D0 h -p). 
Elementary calculation yields the right subgradient Sh G dh((3h)'- 

s h = -s f -D0 h -p). (5) 
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The /-subproblem 

Using the subgradient we construct a linear minorant of the penalty function h(-) as follows: 

h(p) = h0 h ) + 8%(p - p h ). 
This approximation is employed in the optimization problem 

mm f(p) + h{p) + \\\P-pf D . (6) 

The optimal solution of this problem is denoted by $f. It will be used in the next iteration as the point 
at which the new linearization of /(•) will be constructed. The next subgradient of /(•) to be used in the 
/i-subproblem will be 

8 f = -8 h -D0 f -p). (7) 

The update step 

The update step can be applied after any of the subproblems, or after both of them. It changes the current 
best approximation of the solution p, if certain improvement conditions are satisfied. It uses a parameter 
7 G (0, 1). We describe it here for the case of applying the update step after the /-subproblem; analogous 
operations are carried out if the update step is applied after the /i-subproblem. 
At the beginning of the update step the stopping criterion is verified. If 

f0 f ) + h0 f )>f0) + h0)-s, (8) 

the algorithm terminates. Here e > is the stopping test parameter. 
If the the stopping test is not satisfied, we check the inequality 

f(P f ) + h{fif) < (1 - 7) [f0) + HP)] + 7 [fWf) + HPf)] ■ (9) 

If it is satisfied, then we update P <— j3f, otherwise ft remains unchanged. 

If the update step is applied after the h- subproblem, we use Ph instead if /3f in the inequalities ^ and @. 

The update step is a crucial component of the alternating linearization algorithm; it guarantees that 
the sequence {C(P k )} is monotonic, and it stabilizes the entire algorithm (see the remarks at the end of 



section 5.1 1 



2.2 Relation to operator splitting and alternating direction methods 

Our approach is intimately related to operator splitting methods and their dual versions, alternating direction 
methods, which are recently very popular in the area of signal processing (see, e.g., (Boy d et"aL| 2010[ 



Combettes and Pesquet||2010) [Fadili an dPeyre]|2011| l). To discuss these relations, it is convenient to present 
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our method formally, and to introduce two running proximal centers: 



z f = P-D- 1 s f , 
z h = P-D- 1 s h . 

After elementary manipulations we can absorb the linear terms into the quadratic terms and summarize the 
alternating linearization method as follows. 

Algorithm 1 Alternating Linearization 
l: repeat 

2: ^«s-argmin{/i(/3) + \\\P-z f f D ) 
3: if {Update Test for ph) then 
4: $<^p h 
5: end if 

6: Z h -<r- j3 + j3 h - Zf 

7: p f <-axgmm{f(P) + ±\\P-z h \\l} 
8: if (Update Test for pf) then 
9: P^Pf 

10: end if 

11: Zf^P + Pf-Z h 
12: until (Stopping Test) 



The Update Test in lines [3] and [8] is the corresponding version of inequality Q. The Stopping Test is 
inequality ([8]). 

If we assume that the update steps in lines [4] and [9] are carried out after every /i-subproblem and every 
/-subproblem, without verifying the update test Q, then the method becomes equivalent to a scaled version 
of the Peaceman-Rachford algorithm (originally proposed by (Peaceman and Rachford] |1955 1 for PDEs and 
later generalized and analyzed by ( [Lions and Mercier||1979[ ); see also ( |Combettesl |2009| ) and the references 
therein). If D = pi with p > 0, then we obtain an unsealed version of this algorithm. 

If we assume that the update steps are carried out after every /i-subproblem without verifying inequali- 
ty Q, but never after /-subproblems, then the method becomes equivalent to a scaled version of the Douglas- 
Rachford algorithm (introduced by (Douglas and Rachford, 1956 ), and generalized and analyzed by ( [Lions 



and Mercler] 1979 1; see also (Bauschke and Combettes 2011| ) and the references therein). As the roles of / 



and h can be switched, the method in which updates are carried always after /-subproblems, but never after 
/i-subproblems, is also equivalent to a scaled Douglas-Rachford method. 

Operator splitting methods are not monotonic with respect to the values of the objective function C(P). 
Their convergence is based on monotonicity with respect to the distance to the optimal solution of the problem 
( |Lion s and Mercier} [1979[ |Ecks tein and Bertsekas"l |1992| ). 

In contrast, the convergence mechanism of our method is different; it draws from some ideas of bundle 



methods in nonsmooth optimization ( |Hiriart-Urruty and Lemarechal[ |1993[ |Kiwiel[ [1985; Rus zczynski[ 
2006). Its key element is the update test employed in At every iteration we decide whether it is beneficial 
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to make a Peaceman-Rachford step, any of the two possible Douglas-Rachford steps, or none. In the latter 
case, which we call the null step, j3 remains unchanged, but the trial points /3/j and fit are updated. These 
updates continue, until (3^ or fit become better than /3, or until optimality is detected (cf. the remarks at the 



end of section 5.1 ) 



Alternating direction methods are dual versions of the operator splitting methods, applied to the following 
equivalent form of the problem of minimizing Q: 



min/(/3i) + h((3 2 ), subject to fix = (3 2 . 



(10) 



In regularized signal processing problems, when f(/3) = ip(X/3) with some fixed matrix X, the convenient 
problem formulation is 

min ip(v) + h(/3), subject to v = X/3. 

The dual functional has the form of a sum of two functions, and the operator splitting methods apply. The 
reader may consult ptoyd et aL 2010} Combettes and Pesquet| 2010) for appropriate derivations. It is also 
worth mentioning that the alternating direction methods are sometimes called split Bregman methods in the 
signal processing literature (see, e.g., (Goldstein and Osher 2009 Ye and Xie) |201 1 1, and the references 
therein). 

However, to apply our alternating linearization method to the dual problem, we would have to be able to 
quickly compute the value of the dual function, in order to verify the update condition (|9]), as discussed in 
detail in ( |Kiwiel et al. 1999). This is the reason for our preference toward the primal version of the method. 



2.3 Convergence 

Convergence properties of the alternating linearization method follow from the general theory developed in 
(Kiwiel et al. 1999 1. Indeed, after the change of variables £ = D l l 2 f3 we see that the method is identical 
to Algorithm 3.1 of (|Kiwiel et al.||1999]), with pk = 1. The following statement is a direct consequence of 



dKiwiel etal.|[T999l Theorem 4.8). 

Theorem 1 Suppose that the set of minima of the function^ is nonempty. Then the sequence {/3 k } generated 
by the algorithm is convergent to a minimum point f3* of the function ([3]). Moreover, every accumulation point 
(s*p s^) of the sequence {(sh, s^)} satisfies the relations: £ df(/3*), 6 dh{(3*), and s*^ + s* h = 0. 

For structured regularization problems the assumption of the theorem is satisfied, because both the loss 
function /(•) and the regularizing function h(-) are bounded from below, and one of the purposes of the 
regularization term is to make the set of minima of the function £(•) nonempty and bounded. 



3 Application to lasso regression 



First, we demonstrate the alternating linearization algorithm (ALIN) on the classical lasso regression problem. 
Due to the separable nature of the penalty function, very efficient coordinate descent methods are applicable 
to this problem as well ( Tseng 200 1| ), but we wish to illustrate our approach on the simplest case first. 
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In the lasso regression problem we have 



f(J3) = l\\y-Xp\\l h(P) = XWPh, 

where X is the n x p design matrix, y 6 R n is the vector of response variables, (3 6 W is the vector of 
regression coefficients, and A > is a parameter of the model. 

We found it essential to use D = diag(X T X), that is, dj = XjXj, j = 1, . . . ,p. This choice is related 
to the diagonal quadratic approximation of the function f(/3) = ^\\y — -X73|||, which was employed (for 



similar objectives in the context of augmented Lagrangian minimization) by Ruszczyhski ( 1995 ). Indeed, in 
the /i-subproblem in the formula ( fTT| ) below, the quadratic regularization term is a quadratic form built on 
the diagonal of the Hessian of /(•). 

The ft-subproblem 

The problem ([4]), after skipping constants, simplifies to the following form 

min sp + XWPh + lWP-PWl. (11) 

with Sf = X T (Xj3f — y). Writing Tj = /3 — Sfj/dj, we obtain the following closed form solutions of §n\ , 
which can be calculated component-wise: 

j3 hj = sgn(Tj)max (o,|Tj| - — j = l,...,p. (12) 



The subgradient of h{-) at is calculated by §5^. 
The /-subproblem 

The problem ([6]), after skipping constants, simplifies to the unconstrained quadratic programming problem 

mm s T h P + \\y- X$f 2 + - frf D . (13) 

Its solution can be obtained by solving the following symmetric linear system in 5 = f3 — (3: 

(X T X + D)5 = X T (y-Xp)-s h . (14) 



This system can be efficiently solved by the preconditioned conjugate gradient method (see, e.g., (Golub and 
1996 1), with the diagonal preconditioner 2D = 2diag(X T X). Its application does not require 



Van Loan 



the explicit form of the matrix X T X; only matrix- vector multiplications with X and X T are employed, and 
they can be implemented with sparse data structures. 
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4 Application to generalized lasso regression 



In the following we apply the alternating linearization algorithm to solve the generalized lasso problem Q. 
Here we assume the least squares loss, as in the previous subsection. The objective function can be written 
as follows: 

C(J3) = f{fi) + h(P) = |||y - X(3f 2 + XWRPh. (15) 
For example, for the one-dimensional fused lasso, R is the following (p — 1) x p matrix: 



-1 1 ... 
-1 1 ... 



... - 
but our derivations are valid for any form of R. 

The ft-subproblem 

The ft,-subproblem can be equivalently formulated as follows: 



-1 1 



min sJ/3 + A||z||i + !l|/3-/?||i3 subject to Rj3 = z. 



(16) 



Owing to the use of D = diag(X T X), and with Sf = X T (Xf3 — y), it is a quite accurate approximation of 
the original problem, especially for sparse X ( Ruszczy nski| 1 1 995) . 



The Lagrangian of problem ( 16 1 has the form 

L(p, z, fj) = + A||z||i + n T (RP - z) + || 



\d, 



where fj, is the dual variable. We see that the minimum of the Lagrangian with respect to z is finite if and 
only if 1 1// 1 loo < A. Under this condition, the minimum value of the z-terms is zero and we can eliminate 
them from the Lagrangian. We arrive to its reduced form, 



L(p,ti) = s T f fi + n T Rf3 + \ 



(17) 



To calculate the dual function, we minimize n) over /3 £ W. After elementary calculations, we obtain 
the solution 

P h = P-D-\s f + R T t i). (18) 



Substituting it back to ( 17 1, we arrive to the following dual problem: 

max -l{i T RD~ 1 R T fi + fi T R0 - D~ l s f ) subject to HmIIoo < A. (19) 
This is a box-constrained quadratic programming problem. It can be efficiently solved by the active-set 
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box-constrained preconditioned conjugate gradient algorithm with spectral projected gradients, as described 



in ( Birgin and Martinez 2002 Friedlander and Martinez 1994). The diagonal of the matrix RD 1 R T is a 



good preconditioner for this method in the applications that we dealt with. We summarize the most important 
operations of this method in the Appendix. It should be stressed that its application does not require the 
explicit form of the matrix RD~ 1 R T ; only matrix- vector multiplications with R and R T are employed, and 
they can be implemented with sparse data structures. 



The solution ft of the dual problem can be substituted into ( 18 1 to obtain the primal solution. 



The /-subproblem 



We obtain the update (3f by solving the linear equation system ( 14 1, exactly as in the lasso case. 



The special case of X = I 



If the design matrix X = I in ( 15 1, then our method solves the problem in one iteration, when started from 
(3 = y. Indeed, in this case we have Sf = 0, D = I, and the first /i-subproblem becomes equivalent to the 



original problem ( [15] ) 

i t \ i • i \ 1 1 - ■ 1 1 . ._ L _ 



min A || ^ || i + l\\f3 - y\\% subject to R/3 = z. (20) 



The dual problem ( 19 ) simplifies as follows: 

max — i/z RR /z + n Ry subject to \m Lq < A. (21) 

It can be solved by the same conjugate gradient method with bounds, with the preconditioner equal to the 
diagonal of the matrix RR T . The optimal primal solution is then fth = y — i? T /x. 



5 Numerical experiments 

In this section, we present results on a number of simulation and real data studies involving a variety of non 
differentiable penalty functions. We compare the alternating linearization algorithm (ALIN) with competing 
approaches in terms of iteration steps, computation time and estimation accuracy. All these studies are 
performed on an AMD 2.6GHZ, 4GB RAM computer using MATLAB. 



5.1 Simulation studies 

In this experiment, we compare the ALIN algorithm with three different approaches using data sets generated 
from a linear regression model y = YJj=i x i@j + e w i tn pre- specified coefficients (3j and varying dimension 
p. Xj is drawn from the normal distribution with zero mean and unit variance, e is the noise vector that is 
generated from the normal distribution with zero mean and variance equal to 0.01. Among these coefficients, 
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10% equal 1, 20% equal 2, and the rest are zeros. For instance, with p = 100, we may have 



Pi 



1 forj = 11,12, ...,20, 

2 for j = 20,..., 49, 
otherwise. 



Table[T]reports the run times of ALIN and three competing algorithms: the generic quadratic programming 
solver (SQOPT), the alternating direction (split Bregman) method (BREGMAN) of ( |Ye and Xie"l|2011| ) and 
Nesterov's subgradient method (SLEP) of ( |Liu et al.| 2011 Nesterov 2007 1. In this study, we fix the sample 
size to 1000 and vary the dimension of the problem from 1000 to 50000. Each method is repeated 10 times 
and the average is reported. The Split Bregman method and SLEP are run to optimal solution corresponding 
to the stopping criteria built in the packages. We stop ALIN runs when the objective function value attained is 
as good as that of SLEP. Judging from these results, ALIN clearly outperforms the other methods in terms of 
speed for most cases. The improvements on run time range from 1 .5 to 3 folds depending on the experimental 
setting, and becomes more significant when the data dimension grows higher. This is particularly significant 
in view of the fact that ALIN was implemented as a MATLAB code, as opposed to the executables in the 
other cases. 



Table 1: Run time comparison between SQOPT, BREGMAN, SLEP and ALIN (in seconds). 



Problem size 


SQOPT 


BREGMAN 


SLEP 


ALIN 


n = 


= 1000, p = 100, A = 


= 0.1 


0.2 


15.2 


0.02 


0.23 


n = 


1000, p = 1000, A : 


= 0.1 


9.8 


103 


0.7 


7 


n = 


1000, p = 5000, A = 


= 0.1 


940 


471 


31 


17.6 


n = 


1000, p = 10000, A 


= 0.1 


NA 


879 


126.7 


92.7 


n = 


1000, p = 20000, A 


= 0.1 


NA 


2133 


489 


433.5 


n = 


1000, p = 50000, A 


= 0.1 


NA 


3633 


1401.6 


666.2 



It is interesting to compare the approximation of the optimal solution found by the methods, as depicted 
in Figure [T] 

Next we investigate how our method approaches the optimal objective function value compared to other 
methods. Using the above simulated data set with n = 1000, p = 5000, and A = 0.1, we run three methods 
to convergence and plot the objective function value attained along the iterations. We obtain the optimal 
value C* from SQOPT. At each iteration, we can calculate the difference between the optimal value and 
the current function value. The plots in Figure [2] demonstrate how far the methods are from the optimal 
value along the iterations. We also provide in Figure [3] the dependence of the running time of ALIN on 
the dimensions of the problem, to illustrate its scalability. The efficiency of the method is due mainly to its 
good convergence properties, but also to the efficiency of the preconditioned conjugate gradient method for 
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Figure 1: Results of using fused lasso penalty on a simulated data set with n = 1000, p = 5000, A = 0.1. 
Plots (a), (b), (c), and (d) correspond to the original /3, results from ALIN, SLEP, and Bregman, respectively. 

solving the subproblems. It employs sparse data structures and converges rapidly. Usually, between 10 and 
20 iterations of the conjugate gradient method are sufficient to find the solution of a subproblem. 

The update test @ is an essential element of the ALIN method. For example, in a case with n = 1000, 
p = 5000, and A = 0.1, the update of (3 occurred in about 80% of the total of 70 iterations, while other 
iterations consisted only of improving alternating linearizations. If we allow updates of /3 at every step, the 
algorithm takes more than 5000 iterations to converge in this case. Similar behavior was observed in all other 
cases. These observations clearly demonstrate the difference between the alternating linearization method 
and the operator splitting methods. 

5.2 CGH data example 

In this study we present the results on analyzing the CGH data using fused lasso penalty. CGH is a technique 
for measuring DNA copy numbers of selected genes on the genome. The CGH array experiments return the 
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Figure 2: Simulated data set with n = 1000, p = 5000, A=0.1. Plots (a), (b), and (c): ln(Error) versus 
iteration number of SLEP, ALIN, and Bregman, respectively. 
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(b) Running time as the number 
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Figure 3: Running time of ALIN as dimensions change. 
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Figure 4: Fused lasso applied to CGH data, A = 3. 



log ratio between the number of DNA copies of the gene in the tumor cells and the number of DNA copies in 
the reference cells. A value greater than zero indicates a possible gain, while a value less than zero suggests 
possible losses. Tibshirani and Wang (2008 ) applied the fused lasso signal approximator for detecting such 
copy number variations. 

This is a simple one-dimensional signal approximation problem with the design matrix X being the 
identity matrix. Thus the advantage of ALIN over the other three methods is not significant, due to the 
overhead that ALIN has during the conjugate gradient method implemented in MATLAB. Indeed the solution 
time of ALIN is comparable to that of Bregman and SLEP. 

Figure[4]presents the estimation results obtained by our ALIN method. The green dots shows the original 
CNV number, and the red line presents the fused lasso penalized estimates. 



5.3 Total variation based image reconstruction 

The one-dimensional fused lasso can be naturally extended to two-dimensional cases for image processing. 
In this section we present several experiments on image denoising and deblurring using two-dimensional 
fused lasso penalty. Although of similar forms, higher-order fused lasso is fundamentally different from 
the one-dimensional fused lasso, as the structural matrix R defined in eq. (|2]) is singular. This additional 
complication introduces considerable challenges in the path type algorithms (Tibshirani and Taylor 2011 



and additional computational steps need to be implemented to guarantee convergence. 

The ALIN algorithm does not suffer from complications due to the singularity of R, because the dual 
problem ( TP?] ) is always well-defined and has a solution. Even if the solution is not unique, ( fT%| ) is still an 
optimal solution of the /i-subproblem, and the algorithm proceeds unaffected. 

Let y be an m x n observed noisy image; one attempts to minimize the following objective function: 



„4(/3)|| 2 + AM/3), 
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(a) the original image (b) the noisy image (c) the denoised image 

Figure 5: Results of denoising using fused lasso penalty with A=0.05. Plots (a), (b) and (c) correspond to 
the original image, the noisy image, and the de-noised image, respectively 

where 

m— In— 1 m—l n—1 

i=i j=i i=i j=i 

and .4 : R mxn — > R mxn is a linear transformation. When A is the identity transformation, the problem is 
to denoise the image y. The penalty function h(/3) is similar to the total variation (TV) penalties widely used 
for image denoising and deblurring. 

In the following experiments, we apply the fused lasso penalty to recover noisy and blurred images to 
their original forms. The two images considered are of size 256 by 256, which results in solving very large 
fused lasso problems (the matrix R has dimensions of about 262000 x 66000). 

In the denoising example, noise drawn from the normal distribution with zero mean and standard deviation 
0.02 is added to the cameraman picture. In Figure[5]we show the denoising result. Clearly, the ALIN algorithm 
is able to recover the original image from the noisy image fairly well. This is due to the fact that the two- 
dimensional fused lasso penalizes the difference between neighboring pixel values and effectively smoothes 
out the image and eliminates noise. As A = I in this case, consistent with the observations made at the end 
of section [3J the ALIN method solves the denoising problem in one iteration. 

In the deblurring example, we first blur the image, by replacing each pixel with the average of its neighbors 
and itself. This operation defines the kernel operator A used in the loss function | \\y — A(/3) || 2 . Then we 
add noise as before. The deblurring results are shown in Figure [6] and similar effects are observed. 

Neither Split Bregman, nor SLEP is directly applicable to two-dimensional fused lasso problem. The 
ALIN method is able to solve this problem in 2502 seconds and 9 iterations, which clearly demonstrates its 
scalability. We may also inspect the de -blurred image produced by the gradient based algorithm for image 
deblurring (FISTA). In our experiments, we run ALIN and FISTA on the blurred image with two different 
values of A and we check whether FISTA can attain the same objective function value as ALIN. It turned out 
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(a) the original image 

(b) the blurred image 

(c) ALIN deblurred image 

(d) FISTA deblurred image 



(c) (d) 

Figure 6: Results of deblurring using fused lasso penalty. Plots (a), (b), (c), and (d) correspond to the original 
image, the blurred image, the ALIN de-blurred image, and the FISTA d-blurred image, respectively. 
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that FISTA could not do so even after 10000 iterations. The results are in Table[2] 

Table 2: Run time comparison on image deblurring. 



Parameter 


ALIN 


FISTA 




Function value Running time Iterations 


Function value Running time Iterations 


A = 0.001 
A = 0.005 


3.13 2800 9 
11.3 2848 8 


5.28 2502 > 10000 
20.52 2760 > 10000 



5.4 Application to a narrative comprehension study for children 

With high dimensional fused lasso penalty, the constrained optimization problem with identity matrix is 
already difficult to solve, and a large body of literature has been devoted to solving this problem. However, 
there has not been a systematic treatment of penalized regression using non-separable penalty functions with 
an arbitrary design matrix X that may have non-full rank. 

In the following we present the results of penalized regression between the measurement of children's 
language ability (the response y) and voxel level brain activity during a narrative comprehension task (the 
design matrix X). Children develop a variety of skills and strategies for narrative comprehension during 
early childhood years ( |Karunanayaka et aL 2010 1. This is a complex brain function that involves various 



cognitive processes in multiple brain regions. We are not attempting to solve the challenging neurological 
problem of identifying all such brain regions for this cognitive task. Instead, the goal of this study is to 
demonstrate ALIN's ability for solving constrained optimization problems of this type and magnitude. 



The functional MRI data are collected from 313 children with ages 5 to 18 (Schmithorst et al. 2006). 
The experimental paradigm is a 30-second block design with alternating stimulus and control. Children are 
listening to a story read by adult female speaker in each stimulus period, and pure tones of 1 -second duration 
in each resting period. The subjects are instructed to answer ten story -related multiple-choice questions upon 
the completion of the MRI scan (two questions per story). The fMRI data were preprocessed and transformed 
into the Talairach stereotaxic space by linear affine transformation. A uniform mask is applied to all the 
subjects so that they have measurements on the same set of voxels. 

The response variable y is the oral and written language scale (OWLS). The matrix X records the activity 
level for all the 8000 voxels measured. The objective function is the following: 

£(P) = l\\y-XP\\1 + \h(P), 



17 



where 

m— 1 n— 1 p— 1 

i=i j=i fe=i 

m-lp-l n— 1 

i+l,n,fcl + |A,n,fc _ Pi,n,k+l\} + ^{l^mj.p ~ Pm,j+l,p\} 
i=l fc=l 3=1 
n-lp-l m— 1 

+ ^ ] ^ X l^"ii3ifc ~~ /^n,j+l,fc| + |/3mj',fc — Pm,j,k+1 i+l,n,p|} 
j=l k=l i=l 

m— In— 1 p— 1 

+ E Eil^J.P ~~ &+l,i,pl + lAj,p - A,i+l,p|} + J ^2{\Pm,n,k - Pm,n,k+l\}, 
i=l j=l k=l 

and m = 31, n = 35, and p = 15. 

Clearly, the design matrix does not have full rank as the dimension is far greater than the sample size. 
None of the methods that we compared previously work in this setting with a 3-d fused lasso penalty. The 
ALIN algorithm terminates in 44 steps and the run time is 3200 seconds. 





Figure 7: Results of regularization regression with combined lasso and 3-d lasso penalty. The tuning 
parameters of fused lasso is 0.2 for both figures. The tuning parameter for lasso is 0.2 for the left and 0.6 for 
the right. 

While the main purpose of this study is to demonstrate the capability of the ALIN algorithm for solving 
penalized regression problems with 3-d lasso, there are also some interesting neurological observations. One 
objective of this study is to identify the voxels that are significant for explaining the performance score y. 
These voxels constitute active brain regions that are closely related to the OWLS. Figure|7]presents the results 
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of fitted coefficients using combined lasso and fused lasso penalty. The highlighted regions shown in the 
maps are areas with more than 10 voxels (representing clusters of size 10 and above). The left plot in the 
figure is the optimal solution obtained using ten-fold cross validation. The optimal tuning parameters are 
0.2 for both fused lasso and lasso penalties. Roughly speaking, five brain regions have been identified. The 
yellow area to the rightmost side of the brain is situated in the Wernicke area, which is one of the two parts 
of the cerebral cortex linked to speech. It is involved in the understanding of written and spoken language. 
The only difference between the left and right plots is the value of the tuning parameter for the lasso penalty, 
which is 0.2 and 0.6 respectively. Clearly, the right plot shrinks more coefficients to zero, which results in a 
reduced number of significant regions, as compared to the left plot. 




Figure 8: Results of regularization regression with lasso penalty only. The tuning parameter is 0.2. The 
cluster sizes are 1 and 10 for the left and right plot respectively. 

We further study this regularization problem using only lasso penalty. Figure [8] shows the fitted maps. 
The lasso tuning parameter is 0.2 for both plots. The left plot is of cluster size 1, and the right plot uses 
cluster size 10. From the right plot, clearly none of the areas identified have more than 10 voxels. Comparing 
this with Figure[7] we see that the 3-d fused lasso penalty imposes smoothing constraints on the neighboring 
coefficients, thus allowing to identify larger areas significant for the response variable y. The simple lasso 
penalty imposes shrinkage on the coefficients individually, resulting in rather disjoint significant voxels. Such 
scatterness is much less informative for neurologists than larger areas identified by the three-dimensional 
fused lasso penalty. 
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6 Conclusion 



The alternating linearization method is a specialized nonsmooth optimization method for solving structured 
nonsmooth optimization problems. It combines the ideas of bundle methods and operator splitting methods, 
to define a descent algorithm in terms of the values of the function that is minimized. We have adapted the 
alternating linearization method to structured regularization problems by introducing the idea of diagonal 
quadratic approximation and developing specialized methods for solving subproblems. As a result, a new 
general method for a variety of regularization problems has been obtained, which has the following theoretical 
features: 

• It deals with nonsmoothness directly, not via approximations, 

• It is monotonic with respect to the values of the function that is minimized, 

• Its convergence is guaranteed theoretically. 

Our numerical experiments on a variety of structured regularization problems illustrate the applicability 
of the alternating linearization method and indicate its practically important virtues: speed, scalability, and 
accuracy. It clearly outperforms extant methods, and it can solve problems which were unsolvable otherwise. 

Its efficacy and accuracy follow from the use of the diagonal quadratic approximation and from a special 
test, which chooses in an implicit way the best operator splitting step to be performed. The current approxi- 
mate solution is updated only if it leads to a significant decresase of the value of the objective function. 

Its scalability is due to the use of highly specialized algorithms for solving its two subproblems. The 
algorithms do not require any explicit matrix formation or inversion, but only matrix-vector multiplications, 
and can be efficiently implemented with sparse data structures. 



Our study of narrative comprehension for children in section 5.4 is an illustration of broad applicability 
of the alternating linearization method. It is, to the best of our knowledge, the first successful method for 
this three-dimensional fused lasso problem. 
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Appendix 

Preconditioned conjugate gradient method 

The preconditioned conjugate gradient method is used in our method to solve quadratic optimization problems 
of the following form: 

min f(x) = \x T Ax — b T x, 

x&F 

where x 6 IR™, A is a positive semidefinite matrix of dimension n x n, b G R n , and F is a box in IR™. 

In the /-subproblem fB] ), x corresponds to the difference 5 = (3 — f5, A = X T X + diag(X T X) and is 
positive definite, b = X T (y — X/3) — Sh, and F is the whole space. In this case we use the preconditioner 
M = 2 diag(X T X) . The method always starts from x° = 0, which corresponds to f3 = f3. 



lz?T 



In the dual problem ( fT9| ) of the h- subproblem, x corresponds to the multipliers p,, A = RD R 



b = R(j3 — D 1 Sf) and F is a closed face of the set 

ft = {x e IT : \\x\loo < A}. (22) 

We use the preconditioner M = dia.g(RD~ l R T ). The method is a part of a more complicated algorithm to 
be described in the next section. It operates on a subspace of the space of multipliers, that is, x is a subvector 
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of fi, and the other matrices and vectors are corresponding submatrices and subvectors of A, b, and M. We 
still use the same notation; it will not lead to misunderstandings, we hope. The method starts from or from 
the best point found so far. 

Algorithm 2 Preconditioned Conjugate Gradient Method 

1: g° <- Ax° -b,z ^- M~ 1 g°, d° <- -z°, k^O 
2: repeat 

{gk) T z k 



(d k ) T Ad k 
if (x k+1 $ F) then 



x k+i x k + Tk( jk 



T k «- max{r : x k + rd k G F} 
x k «- x k + T k d k 
return 
end if 

g k+l ^gk + TkM k 

z k+i M^g k+1 

( Z k +Y(g k ^-g k ) 



12: Ctfc i u\ r r u 

[z K ) x g K 



d k+i = _ z k+i + akd k 

k <r- k + 1 

until { (\\g k \\ < e) or {Leave Face) } 



If the method is applied to an unconstrained problem and satisfies the stopping test in line 15 then the 
point x k is an approximation of the solution of the problem. If the stopping test \\g k \\ < £ is satisfied in 
the box-constrained problem, the optimal solution in the current face F is found. The condition Leave Face 
verifies whether it is beneficial to leave the current face F before reaching optimality. We describe it in 
more detail in the next section. In general, further operations are performed to change the face over which 
optimization is performed, or to detect optimality. If the method reaches the boundary of the face F in line 
[5] and returns, other operations are performed to change the face over which optimization is performed. In 
both cases, the point x k becomes the starting point for the next phase of the method, and the index k is reset 
to 0. We also reset the method in the case when (g k ) T z k < 0. 

Active-set box-constrained algorithm with spectral projected gradients 

Active-set box-constrained algorithm (GENCAN) minimizes the function f{x) on the box SI defined in (T221). 



The feasible region Q is divided into disjoint faces by specifying active constraints for each face. For each 
partition Fl = {I-,Io, J + } of the set I = {1, 2, . . . , n} into three disjoint sets, so that I = J_ U Jo U /+, we 
specify the corresponding face as 

F n = {x e : Xi = -A if % G J_, Xi = A if i G I+, -A < xt < A if i G Jo}. 



24 



Within the current face we run the conjugate gradient method, as specified above, by replacing at each point 
x G F n the gradient g = Vf(x) by the "gradients within the face": 



9? 



ifi€l_UJ + , 
gi ifielo, 



If the conjugate gradient method reaches the boundary of the current face F and exits in line [8} then one or 
more indices i £ Iq are moved to either I_ or I + , depending on whether x\ = — A or x\ = A. After that, 
the current point x k becomes the starting point x° for the next pass of the conjugate gradient method in the 
new face. 

At the end of each iteration of the conjugate gradient method we verify the following Leave Face test. 
The "projected gradient" at x G Q is defined as: 



if i G I- and gi > or i G I + and gi < 0, 

i = 1, ... ,n. 

gi otherwise, 



If \\g || = 0, then the optimal solution has been found. Otherwise, if 

\\g n \\<v\\g p l (23) 

where r] G (0, 1) is a fixed constant, then we decide that it is beneficial to leave the face F (the Leave Face 
test is true). In the latter case, for a "spectral coefficient" aj. > 0, we calculate the direction 

d k = P n (x k - a k g) - x k , 

where Pn(-) is the operation of the orthogonal projection on 17. The calculation of a^, which measures the 
curvature of the objective function along the last direction, is essential here; in a typical case 

__ (x fc -x fc - 1 ) T (( 7 fc - 5 fc - 1 ) 

Next, having the current point x k and the direction d k , we carry out the constrained line search: 

T fc = argmin{/(x fc + rd k ) : r > 0, x k + rd k G Q}, x k+1 = x k + T k d k . 



Due to condition ( [23] ), this step will always result in the change of the current face. After that, the new face 
F is determined, and the conjugate gradient method re-starts from the point x k+1 , which plays the role of 
the point x° in the new cycle. 



More details about GENCAN can be found in (Birgin and Martinez 2002; Friedlander and Martinez 
T9941). 
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